In this project, we are going to visualize the dataset 2016collisionsfinal.csv which represent collisions occurring on public roadways.
We import pandas to work with our data, Matplotlib to plot charts, and Seaborn to make our charts prettier.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.offline as py
import plotly.graph_objs as go
color = sns.color_palette()
sns.set(style="darkgrid")
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder, MinMaxScaler
Let's load the GlobalCitiesPBI.csv which has been provided in datasets for the course.
rawdf = pd.read_csv('2016collisionsfinal.csv')
rawdf.head()
# checking missing data
total = rawdf.isnull().sum().sort_values(ascending = False)
percent = (rawdf.isnull().sum()/rawdf.isnull().count()).sort_values(ascending = False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total missing ', 'Percent'])
missing_data
Since the missing values are a very small percentage of data, we can safely drop rows with missing values, we also rename data frame to df:
df = rawdf.dropna(axis=0 , how='any')
df.head()
# comparing sizes of data frames
print("Old data frame length:", len(rawdf), "\nNew data frame length:",
len(df), "\nNumber of rows with at least 1 NA value: ",
(len(rawdf)-len(df)))
print('The dataset 2016collisionsfinal has {} rows and {} features'.format(df.shape[0],df.shape[1]))
we need to choose a list of variables (at most 7) that are crucial to a good understanding of the dataset:
we gonna drop the first column is just the row index, and count for the record of collisions.
we are going to remove the variables "Continent", "Country" because they are kind of redundant variable to geography
the remaining 5 variables gonna be choosing as the top 5 correlated features with the output variable Sort_gen.rating
Some other variables as "Life.Expectancy.in.Years..Female.", "Life.Expectancy.in.Years..Male." and "Life.Expectancy gonna be used in the analysis.
df = rawdf.dropna(axis=0 , how='any')
df = df[['Date', 'Time', 'Location', 'Light', 'X', 'Y', 'Impact_type','Traffic_Control',"Collision_Location",'Road_Surface']]
df['date_m'] = df['Date'].str.split('/', expand=True)[[0]].astype(int)
df['time'] = df['Time'].str.split(':', expand=True)[[0]].astype(int)
df['road_Surface'] = df['Road_Surface'].str.split('-', expand=True)[[0]].astype(int)
df['light'] = df['Light'].str.split('-', expand=True)[[0]].astype(int)
df['traffic_Control'] = df['Traffic_Control'].str.split('-', expand=True)[[0]].astype(int)
df['X'] = df['X'].str.replace('\,', '').astype(float)
df['Y'] = df['Y'].str.replace('\,', '').astype(float)
df = df.drop(axis = 0, index = df[df.X > 1000000].index)
df = df.drop(axis = 0, index = df[df["light"] == 99].index)
df.head()
In this question we gonna create 4 multivariate visualizations for the dataset using the variables listed in question 1.
from plotly import graph_objects as go
L = df["Collision_Location"].unique().tolist()
L_count = []
for st in L:
x = df[df["Collision_Location"] == st].shape[0]
L_count.append(x)
fig = go.Figure(go.Funnel(
y = L,
x = L_count))
fig.show()
plt.figure(figsize=(21,6))
L = df["date_m"].unique().tolist()
L_count = []
for st in L:
x = df[df["date_m"] == st].shape[0]
L_count.append(x)
fig = go.Figure(data=go.Scatter(x = L,
y = L_count,
mode='markers'
))
fig.update_layout(title='number of accidents per month')
fig.show()
Notes:
import plotly.express as px
paral_data = df[['date_m','time','Light','Collision_Location','Traffic_Control']]
fig = px.parallel_categories(paral_data)
fig.show()
Notes:
plt.figure(figsize=(16,6))
import plotly.express as px
px.scatter(df, x='X', y='Y',
color='Impact_type',
width=900, height=400,
title = "relationship X and Y and Impact_type ")
Notes:
Question 3 : “definitive” visualizations for the dataset.
plt.figure(figsize=(16,6))
sorted_ = df.groupby(['Light'])['date_m'].mean()
sns.boxplot(x=df['date_m'], y=df['Light'], order=list(sorted_.index))
import plotly.express as px
px.scatter(df, x='light', y='road_Surface',color='Collision_Location', width=900, height=400,
title = "relationship between light and road_Surface")
Notes:
This plot illustrate well the trend between life expectancy, GDP and gen_rating. All of this variables are positively correlated.
Jerusalem is detecte as outlier for which is the only city in the groupe of sifficency with high life expectancy and GDP.